Methods and Apparatus for Assessing Web Page Decay

ABSTRACT

Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

TECHNICAL FIELD

The present invention generally concerns web pages and more particularlyconcerns methods and apparatus for assessing the decay of web pages.

BACKGROUND

The rapid growth of the web has been noted and tracked extensively.Recent studies, however, have documented the dual phenomenon: web pagesoften have small half-lives, and thus the web exhibits rapid decay aswell. Consequently, page creators are faced with an increasinglyburdensome task of keeping links up to date, and many fall behind. Inaddition to individual pages, collections of pages or even entireneighborhoods on the web exhibit significant decay, rendering them lesseffective as information resources. Such neighborhoods are identified byfrustrated searchers, seeking a way out of these stale neighborhoods,back to more up-to-date sections of the web.

On Nov. 2, 2003, the Associated Press reported that the “Internet [is]littered with abandoned sites.” [20] The story was picked up by manynews outlets from USA's CNN to Singapore's Straits Times. The articlefurther states that “[d]espite the Internet's ability to deliverinformation quickly and frequently, the World Wide Web is littered withdeadwood—sites abandoned and woefully out of date.”

Of course this is not news to most net-denizens, and speed of deliveryhas nothing to do with the quality of content, but there is no denialthat the increase in the number of outdated sites has made findingreliable information on the web even more difficult and frustrating.Part of the problem is an issue of perception: the immediacy andflexibility of the web create the expectation that the content isup-to-date; after all, in a library no one expects every book to becurrent, but, on the other hand, it is clear that books once publisheddo not change, and it is fairly easy to find the publication date.

While there have been substantial efforts in mapping and understandingthe growth of the web, there have been fewer investigations of its deathand decay. Determining whether a URL is dead or alive is quite easy, atleast in the first approximation, and, in fact, it is known that webpages disappear at a rate of 0.25-0.5%/week. However, determiningwhether a web page has been abandoned is much more difficult.

Thus, those skilled in the art desire a method for assessing the decaystatus or “staleness” of a web page. In addition, those skilled in theart desire methods for assessing the staleness of a web page so that themethod can be used as a way of ranking web pages. Further, those skilledin the art desire methods and apparatus for use in web maintenanceactivities. Methods and apparatus that accurately assess the stalenessof web pages are particularly useful in managing web maintenanceactivities.

SUMMARY OF THE PREFERRED EMBODIMENTS

A first alternate embodiment of the present invention comprises asignal-bearing medium tangibly embodying a program of machine-readableinstructions executable by a digital processing apparatus of a computersystem to perform operations for assessing the currency of a web page,the operations comprising: establishing a date threshold, wherein webpages older than the date threshold will be assessed at not beingcurrent; accessing a web page; extracting date information from the webpage identifying the age of the web page; and comparing the dateinformation extracted from the web page to the date threshold.

A second alternate embodiment of the present invention comprises asignal-bearing medium tangibly embodying a program of machine-readableexecutable by a digital processing apparatus of a computer system toperform operations for assessing the currency of a web page, theoperations comprising: receiving a user-specified topicality threshold,where the topicality threshold concerns the topicality of materialcontent of the web page; accessing a web page; extracting topicalityinformation from the web page; and comparing the topicality informationextracted from the web page to the topicality threshold.

A third alternate embodiment of the present invention comprises: asignal-bearing medium tangibly embodying a program of machine-readableinstructions executable by a digital processing apparatus of a computersystem to perform operations for assessing the currency of a web page,the operations comprising: establishing a link threshold, wherein a webpage will be assessed as lacking currency if a percentage of hyperlinkscontained in the web page that link to an active page is less than thelink threshold; accessing a web page containing hyperlinks; testing thehyperlinks; calculating the percentage of hyperlinks that return activeweb pages; and comparing the percentage of hyperlinks that return activeweb pages with the link threshold.

A fourth alternate embodiment of the present invention comprises: asignal-bearing medium tangibly embodying a program of machine-readableinstructions executable by a digital processing apparatus of a computersystem to perform operations for assessing the decay of a web page, theoperations comprising: accessing a subject web page containinghyperlinks; assessing the decay of the subject web page by following arandom walk away from the subject web page, where the random walkconsists of a testing of links on the subject web page and web pageslinked to the subject web page under test; and assigning a decay scoreto the subject web page in dependence on dead links encountered in therandom walk, wherein the decay score is a weighted sliding scale, wherea dead link encountered relatively close in the random walk to thesubject web page in terms of intermediate web pages results in a higherdecay score than a dead link encountered relatively farther away fromthe subject web page.

A fifth alternate embodiment of the present invention comprises: acomputer system for assessing the currency of a web page, the computersystem comprising: an internet connection for connecting to the internetand for accessing web pages available on the internet; at least onememory to store web pages retrieved from the internet and at least oneprogram of machine-readable instructions, where the at least one programperforms operations to assess the currency of a web page; at least oneprocessor coupled to the internet connection and the at least onememory, where the at least one processor performs the followingoperations when the at least one program is executed: retrieving a datethreshold, wherein web pages older than the date threshold will beassessed as not being current; accessing a web page; extracting dateinformation from the web page identifying the age of the web page; andcomparing the date information extracted from the web page to the datethreshold.

A sixth alternate embodiment of the present invention comprises: acomputer system for assessing the currency of a web page, the computersystem comprising: an internet connection for connecting to the internetand for accessing web pages available on the internet; at least onememory to store web pages retrieved from the internet and at least oneprogram of machine-readable instructions, where the at least one programperforms operations to assess the currency of a web page; at least oneprocessor coupled to the internet connection and the at least onememory, where the at least one processor performs the followingoperations when the at least one program is executed: retrieving apredetermined topicality threshold, where the topicality threshold,where the topicality threshold concerns the topicality of materialcomprising a web page; extracting topicality information from the webpage; and comparing the topicality information extracted from the webpage to the topicality threshold.

A seventh alternate embodiment of the present invention comprises: acomputer system for assessing the currency of a web page, the computersystem comprising: an internet connection for connecting to the internetand for accessing web pages available on the internet; at least onememory to store web pages retrieved from the internet and at least oneprogram of machine-readable instructions, where the at least one programperforms operations to assess the currency of a web page; at least oneprocessor coupled to internet connection and the at least one memory,where the at least processor performs the following operations when theat least one program is executed; establishing a link threshold, whereina web page will be assessed as lacking currency if a percentage ofhyperlinks contained in the web page that link to an active page is lessthan the link threshold; accessing a web page containing hyperlinks;testing the hyperlinks; calculating the percentage of hyperlinks thatreturn active web pages; and comparing the percentage of hyperlinks thatreturn active web pages with the link threshold.

An eighth alternate embodiment of the present invention comprises: acomputer system for assessing the decay of a web page comprising: aninternet connection for connecting to the internet and for accessing webpages available on the internet; at least one memory to store web pagesretrieved from the internet and at least one program of machine-readableinstructions, where the at least one program performs operations toassess the decay of web page; at least one processor coupled to theinternet connection and the at least one memory, where the at least oneprocessor performs the following operations when the at least oneprogram is executed: accessing a subject web page containing hyperlinks;assessing the decay of the subject web page by following a random walkaway from the subject web page, where the random walk consists of atesting of links on the subject web page and web pages linked to thesubject web page under test; and assigning a decay score to the subjectweb page in dependence on dead links encountered in the random walk,wherein the decay score is a weighted sliding scale, where a dead linkencountered relatively close in the random walk to the subject web pagein terms of intermediate web pages results in a higher decay score thana dead link encountered relatively farther away from the subject webpage.

Thus it is seen that embodiments of the present invention overcome thelimitations of the prior art. In particular, in the prior art there wasno known way to assess the currency of a webpage. In contrast, theapparatus and methods of the present invention provide a reliable andaccurate method for assessing the currency of a webpage.

The methods and apparatus of the present invention are particularlyuseful in combination with web ranking and enterprise web managementapplications. In web ranking situations, it is not desirable to assign ahigh ranking to a web page that is grossly out of date. Accordingly,having an accurate assessment of the currency of a web page is onefactor that may be used in ranking a particular web page.

In enterprise web management situations, proprietors of web-basedservices wish to continually assess the currency of the web pagesconstituting their web-based services. Thus, having methods andapparatus that can accurately assess the currency of web pages areparticularly useful in managing maintenance activities.

In conclusion, the foregoing summary of the alternate embodiments of thepresent invention is exemplary and non-limiting. For example, one ofordinary skill in the art will understand that one or more aspects orsteps from one alternate embodiment can be combined with one or moreaspects or steps from another alternate embodiment to create a newembodiment within the scope of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other aspects of these teachings are made more evidentin the following Detailed Description of the Preferred Embodiments, whenread in conjunction with the attached Drawing Figures, wherein:

FIG. 1 is a flowchart depicting the steps of a method operating inaccordance with an embodiment of the present invention;

FIG. 2 is a flowchart depicting the steps of a method operating inaccordance with an embodiment of the present invention;

FIG. 3 is a flowchart depicting the steps of a method operating inaccordance with an embodiment of the present invention;

FIG. 4 is a flowchart depicting the steps of a method operating inaccordance with one embodiment of the present invention;

FIG. 5 depicts a block diagram of a computer system suitable forpracticing the methods and apparatus of the present invention;

FIG. 6 is a flowchart depicting the steps of a method operating inaccordance with an embodiment of the present invention;

FIG. 7 is a flowchart depicting the steps of a method operating inaccordance with an embodiment of the present invention;

FIG. 8 is a graph depicting the distribution of the fraction of deadlinks and decay scores for various σ's;

FIG. 9 is a scatter plot of decay scores versus the fraction of deadlinks;

FIG. 10 is a graph depicting the average decay score and fraction ofdead links for papers from the last ten WWW conferences;

FIG. 11 depicts the average decay scores and fraction of dead links for30 Yahoo nodes; and

FIG. 12 depicts the average decay scores and fraction of dead links forFAQS.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

A method for assessing the currency of a web page operating inaccordance with the present invention is depicted in FIG. 1. In step 10,a date threshold is established, where web pages older than the datethreshold will be assessed as not being current. Next, at step 11, a webpage is accessed over the internet. Then, at step 12 date information isextracted from the web page identifying the age of the web page. Inalternate embodiments, the “last-modified” information may be extractedfrom the web page, indicating when the web page was last modified. Next,at step 13, the date information extracted from the web page is comparedto the date threshold. If the date information extracted from the webpage is older than the date threshold, the web page is assessed aslacking currency; if it is younger, the web page is assessed as beingcurrent.

Another method operating in accordance with the present invention isdepicted in FIG. 2. At step 20, the method receives a user-specifiedtopicality threshold, where the topicality threshold concerns thetopicality of material content of the web page. As used herein,“topical” means a web page whose content or subject matter is current orup-to-date; a web page the content of which is not “topical” is outmodedor out-of-date. The topicality threshold can be specified in a number ofways. For example, the topicality threshold can concern date referenceswithin the content of the web page. Alternatively, the topicalitythreshold can concern historical events the presence of which wouldindicate that the page is out of date. Further, the topicality thresholdcan be set by using product identifiers. If a researcher sought toassess the currency of web pages concerning computer hardware, theresearcher could use product identifiers as indicia of whether a webpage is up to date. For example, a web page discussing Pentium IIIprocessors for non-historical reasons would be out of date. Then, atstep 21 a web page is accessed over the internet. Next, at step 22topicality information is extracted from the web page. Then, at step 23,the topicality information extracted from the web page is compared tothe topicality threshold. If the comparison reveals that the informationextracted from the web page lacks topicality when compared to thetopicality threshold, the web page is assessed as lacking currency.Alternatively, if the information extracted from the web page is topicalwhen compared to the topicality threshold, the web page is assessed asbeing current.

A further method operating in accordance with the present invention isdepicted in FIG. 3. At step 30, a link threshold is established, where aweb page will be assessed as lacking currency if a percentage ofhyperlinks contained in the web page that link to active web pages isless than the threshold. Next, at step 31, a web page containinghyperlinks is accessed over the internet. Then, at step 32, thehyperlinks contained in the web page are tested. Next, at step 33, thepercentage of web pages that return active web pages is calculated.Then, at step 34, the percentage hyperlinks that return active web pagesis compared with the link threshold. If the percentage is less than thelink threshold, the web page is assessed as lacking currency; if isgreater than the link threshold, the web page is assessed ads beingcurrent. One of ordinary skill in the art will understand that the linkthreshold could have been set in terms of hyperlinks which do not returnactive web pages.

The next aspect of the present invention concerns assessing whether ahyperlink does, in fact, link to a dead page. Dead links are theclearest giveaway to the obsolescence of a page. Indeed, this phenomenonof “link-rot” has been studied in several areas—for example, Fetterly etal. [16] in the context of web research, Koehler [22, 23] in the contextof digital libraries, and Markwell and Brooks [26, 27] in the context ofbiology education. However using the proportion of dead links as a decaysignal presents two problems.

(1) The first problem—determining whether a link is “dead”—is nottrivial. According to the HTTP protocol [17] when a request is made to aserver for a page that is no longer available, the server is supposed toreturn an error code, usually the HTTP return code 404. As discussed inthe following sections, in fact many servers, including most reputableones, do not return a 404 code—instead the servers return a substitutepage and an OK code (200). The substitute page sometimes gives a writtenerror indication, sometimes returns a redirect to the original domainhome page, and sometimes returns a page which has absolutely nothing todo with the original page. Studies show that these type ofsubstitutions, called “soft-404s,” account for more than 25% of the deadlinks. This issue is discussed in detail and a heuristic is proposed forthe detection of servers that engage in soft 404s. The heuristic iseffective for all cases except for one special case: a dead domain homepage bought by a new entity and/or “parked” with a broker of domainnames: in this special case it can be determined that the server engagesin soft 404 in general but there is no way to know whether the domainhome page is a soft 404 or not.

(2) The second problem associated with dead links as a decay signal isthat they are very noisy signals. One reason is because it is easy tomanipulate. Indeed, many commercial sites use content management systemsand quality check systems that automatically remove any link thatresults in a 404 code. For example, experiments indicate that the Yahoo!taxonomy is continuously purged of any dead links. However, this ishardly an indication that every piece of the Yahoo! taxonomy isup-to-date.

Another reason for the noisiness is that pages of certain types tend tolive “forever” even though no one maintains them: a typical examplemight be graduate students pages—many universities allow alumni to keeptheir pages and e-mail addresses indefinitely as long as they do notwaste too much space. Because these pages link among themselves at arelatively high rate, they will have few dead links on every page, evenlong after the alumni have left the ivory towers; it is only as a largerradius is examined around these pages that a surfeit of dead links isobserved.

The discussion above suggests that the measure of the decay of page pshould depend not only on the proportion of dead pages at distance 1from p but also, and to a decreasing extent, on the proportion of deadpages at distance 2, 3, and so on.

One way to estimate these proportions is via a random walk from p: atevery step if a dead page is reached failure is declared, otherwise withprobability σ success is declared, and with probability 1−σ the walkcontinues. The decay score of p, denoted D(p) is defined as theprobability of failure in this walk. Thus the decay score of a page pwill be some number between 0 and 1.

At first glance, this process is similar to the famous random surfer ofPageRank [7]; however, they are quite different in practice: forPageRank the importance of a page p depends recursively on theimportance of the pages that point top. In contrast the decay of pdepends recursively on the decay of the pages that are linked from p.Thus, computing the underlying recurrence once the web graph is fullyexplored and represented is very similar, but

-   -   1. The decay of a given page can be approximated in isolation,        that is, without having to compute the decay of all pages in the        graph, hence it is a much easier task when the number of nodes        of interest is relatively small.    -   2. While the owner of a page p has few licit means of improving        its PageRank, it can easily reduce its decay by simply making        sure that all the links on page p go to well maintained pages.

It is generally agreed that PageRank is a better signal for the qualityof a page than simply its in-degree (i.e., the number of pages thatpoint to it) and recent studies [29, 10] have shown that the in-degreehas only limited correlation with PageRank. Similar questions can beasked about the decay number versus the dead links proportion:experiments indicate that their correlation is only limited and indeedthe decay number is a better indicator. For instance, on average, theset of 30 pages that analyzed from the Yahoo! taxonomy have almost nodead links, but have relatively high decay, roughly the median valueobservable on the Web. This seem to indicate that Yahoo! has a filterthat drops dead links immediately, but on the other hand the editorsthat maintain Yahoo! do not have the resources to check very oftenwhether a page once listed continues to be as good as it was.

A dead web page is a page that is not publicly available over the web. Apage can be dead for any of the following reasons: (1) its URL ismalformed; (2) its host is down or non-existent; or (3) it does notexist on the host. The first two types of dead pages are easy to detect:the former fails URL parsing and the latter fails the resolution of thehost address. When fetching pages that are not found on a host, the webserver of the host is supposed to return an error; typically the errormessage returned is the 404 HTTP return code. However, it turns out thatmany web servers today do not return an error code even when theyreceive HTTP requests for non-existent pages. Instead, they return an OKcode (200) and some substitute page; typically, this substitute is anerror message page or the home-page of that host or even some completelyunrelated page. Such non-existent pages that cause a server to issue theforegoing result are called “soft-404 pages”.

The existence of soft-404 pages makes the task of identifying dead pagesnon-trivial. Next to be described will be an algorithm for this taskoperating in accordance with one embodiment of the present invention.The pseudo code for the task is reproduced in Appendix A, and aflowchart depicting the steps of the method is shown in FIG. 4. For therest of the discussion, a web page will be identified with its URL, andthe two concepts will be used interchangeably.

A soft-404 page is a non-existent page that does not result in thereturn of an error code. This is because the server to which the webpage request was directed is programmed to issue an alternate pagewhenever a 404 error message would ordinarily be issued. In contrast, ahard 404 page is a non-existent page that returns an error code of 403,404 or 410, or any error code of the form 5xx. Dead pages consist ofsoft-404 pages, hard-404 pages, and a few more cases such as time-outsand infinite redirects discussed below.

Let u be the URL of a page, to be tested whether dead or alive. Letu.host denote the host of u, and let u.parent denote the URL of theparent directory of u. For example, both the host and the parentdirectory URL of http://www.ibm.com/us are http://www.ibm.com; howeverthe parent directory of http://www.ibm.com/us/hr ishttp://www.ibm.com/us. u.host and u.parent can be extracted from u byproper parsing.

An algorithm operating in accordance with aspects of methods andapparatus of the present invention starts by attempting to fetch u fromthe web (Line 3 of the function DeadPage). A fetch (step 100 in FIG. 4;see function atomicFetch) may result in one of the following threeoutcomes: (1) it succeeds, (2) it fails, or (3) it redirects to adifferent URL v. The possible reasons for failure are: (a) u is aninvalid URL and could not be properly parsed (lines 2-3 of atomicFetch);(b) the local DNS server could not resolve the IP address of u.HOST(lines 6-7 of atomicFetch); (c) when creating a connection to u.HOST,there was no response within T seconds (in experiments T=10 was chosen)(Lines 10-11 of atomicFetch); or (d) the web server of u.HOST returns anerror HTTP return code in response to the request for u (Lines 12-13 ofatomicFetch). The HTTP return codes which are considered to be errorsare 403 (Forbidden), 404 (Not found), 410 (Gone), and all the codes ofthe form 5xx (Server errors). If a return code in these classes isreturned, the algorithm concludes that the page does not exist at 112 inFIG. 4. A success is a HTTP return code in the 2xx series or 4xx series(except for 403, 404, 410), and a redirect is indicated by an HTTPreturn code in the 3xx series.

Clearly when the fetch fails, the page is dead. Next to be discussed ishow to analyze the two other cases (success or redirect). The redirectcase is also rather simple. An algorithm operating in accordance withthe present invention attempts to fetch u. If it redirects to a new URLv, it then attempts to fetch v. It continues to follow the redirects,until reaching some URL w_(u), whose fetch results in a success or afailure (see the function fetch). (A third possibility is that thealgorithm detects a loop in the redirect path (Lines 12-13 of fetch) orthat the number of redirects exceeds some limit L, which is chosen to be20 (Lines 14-15 of fetch); in such a case the algorithm declares u to bea dead page, and stops). If the fetch of w_(u) results in a failure, uis declared a dead page as before. If the fetch results in a success(step 114 in FIG. 4), the algorithm proceeds to checking whether u is asoft-404 page.

The algorithm detects whether u is a soft-404 page or not by “learning”whether the web server of u.HOST produces soft-404s at all. This is doneby asking for a page r, known with high probability not to exist onu.HOST at step 120 in FIG. 4. It then compares the server behavior whenasked for r, with its behavior when asked for u.

The first question to be addressed is how to come up with a page r thatis likely not to exist on u.HOST with a high probability. This is doneas follows: first, a URL is chosen, which has the same directory as u,and whose file name is a sequence of R random letters (in experimentsR=25 was chosen; see Line 5 of DeadPage and step 120 of FIG. 4). The URLr is simply the concatenation of the URL u.PARENT with the randomsequence. Since the file name is chosen at random, the probability thatit exists under that directory is at most N/26^(R), where N is thenumber of files that do exist under the directory. For any reasonablevalue of N, this probability is tiny, and thus it can be safely assumedthat the random page r does not exist.

The reason to choose r to be in the same directory as u (and not as arandom page under u.HOST) is that in large hosts different directoriesare controlled by different web servers, and therefore may exhibitdifferent responses to requests for non-existent pages. An example isthe host http://www.ibm.com. When trying to fetch a non-existent pagehttp://www.ibm.com/blablabla, the result is a 404 code. However, a fetchof http://www.ibm.com/us/blablabla returns the home-pagehttp://www.ibm.com/us. Thus http://www.ibm.com/us/blablabla is asoft-404 page, but http://www.ibm.com/blablabla is a hard-404 page.

Next it is necessary to compare the behavior of the web server on r withits behavior on u. Let w_(r) and w_(u) denote the final URLs reachedwhen following redirects from r and u, respectively. Let T_(r) and T_(u)denote the contents of w_(r) and w_(u), respectively. Let K_(r) andK_(u) denote the number of redirects the algorithm had to follow toreach w_(r) and w_(u), respectively.

If the fetch of w_(r) results in a failure, it is concluded at step 132in FIG. 4 that the web server does not produce soft-404 pages. Since thefetch of w_(u) succeeded, the algorithm can safely declare u as alive(Lines 7-8 in DeadPage). Suppose, then, that the fetch of w_(r) resultsin a success. Thus, r is a soft-404 page.

If w_(r)=w_(u) and K_(r)=K_(u), then u and r are indistinguishable. Thisgives a clear indication that u is a soft-404 page except for onespecial case: there are situations when soft-404 pages and legitimateURLs both redirect to the same final destination (for example, to thehost's home-page). A good example of that is the URL http://www.cnn.de(the CNN of Germany), which redirects to http://www.n-tv.de; however,also a non-existent page like http://www.cnn.de/blablabla redirects tohttp://www.n-tv.de. Thus the following heuristic is used: if u is a rootof a web site, then it can never be a soft-404 page (step 140 of FIG. 4and Lines 9-10 of DeadPage; see discussion below about when thisheuristic may fail). Otherwise at step 150, if w_(r)=w_(u) andK_(r)=K_(u), then u is declared a soft-404 page (Step 152 of FIG. 4;Lines 13-14 of DeadPage).

If K_(r)≠K_(u) (step 142 in FIG. 4) the algorithm declares u to be alive(step 180) (even if w_(r)=w_(u)), because the behavior of the web serveron u is different from its behavior on r (Lines 11-12 of DeadPage). Anexample that demonstrates that the number of redirects is crucial forthe test is http://www.eurosport.de/. Fetching http://www.eurosport.de/incurs two redirects that finally land in a valid page. However,fetching http://www.eurosport.de/blablabla redirects first tohttp://www.eurosport.de/ and then results in two more redirects asbefore. Thus, both the valid page and the soft-404 page end up at thesame valid page, but the former requires two redirects while the latterrequires three.

Even if w_(r)≠w_(u) (step 152 in FIG. 4) it is still possible that u isa soft-404 page, because in some hosts each soft-404 page is redirectedinto a unique address (http://www.amazon.com, for example). Thus, thecontents of w_(r) and w_(u), and the parameters K_(r) and K_(u) are nextexamined at step 160. If w_(r)≠w_(u), K_(r)=K_(u), and T_(u) and T_(r)are identical or nearly-identical (near-identity can be checked viashingling [8]), the algorithm declares u to be a soft-404 page (step162; Lines 15-16 of DeadPage). If not, the page is declared to be aliveat step 164. Note that testing near-identity (as opposed to completeidentity) may be important; because sometime the web server embeds thenon-existing URL u in the text of the page it returns or does otherminor changes.

A computer system for practicing the methods of the present invention isdepicted in simplified form in FIG. 5. The data processing system 200includes at least one data processor 201 coupled to a bus 202 throughwhich the data processor may address a memory sub-system 203, alsoreferred to herein simply as “memory” 203. The memory 203 may includeRAM, ROM and fixed and removable disks and/or tape. The memory 203 isassumed to store at least one program comprising instructions forcausing the processor 201 to execute methods in accordance with thepresent invention. Also stored in memory 203 is at least one database204.

The data processor 201 is also coupled through the bus 202 to a userinterface, preferably a graphical user interface (“GUI”) 205 thatincludes a user input device 205A, such as one or more of a keyboard, amouse, a trackball, a voice recognition interface, as well as a userdisplay device 205B, such as a high resolution graphical CRT displayterminal, a LCD display terminal, or any suitable display device. Withthese input/output devices, a user can initiate operations to determinethe currency or staleness of a web page.

The data processor 201 may also be coupled through the bus 202 to anetwork interface 206 that provides bidirectional access to a datacommunications network 207, such as an intranet and/or the internet. Invarious embodiments of the present invention, a host 208 containing webpages to be tested can be accessed over the internet through server 209.

In general, these teachings may be implemented using at least onesoftware program running on a personal computer, a server, amicrocomputer, a mainframe computer, a portable computer, an embeddedcomputer, or by any suitable type of programmable data processor 201.Further, a program of machine-readable instructions capable ofperforming operations in accordance with the present invention may betangibly embodied in a signal-bearing medium, such as, a CD-ROM.

The above scheme is doing its best to capture as many of the cases ofsoft-404 pages as possible. There are other instances of soft-404 errorsthat need to be detected, for example, when the root of a web page is,in fact, a soft-404 page. An emerging phenomenon on the web is the oneof “parked web sites”. These are dead sites whose address wasre-registered to a third party. The third party puts a redirect fromthose dead sites into his own web site. The idea is to profit from theprior promotional works of the previous owners of the dead sites. Areport by Edelman [15] gives a nice description of this phenomenon aswell as a case study of a specific example.

Let n be the total number of pages. Let D ⊂[n] be the set of all deadpages, and let all other pages be live. Let M be the n×n matrix of themulti-graph of links among pages, so that M_(ij) is the number of linkson page i to page j. To begin, one modification is performed on thematrix: M←M+I, adding a self loop to each page. A measure D_(σ)(i) willbe defined in terms of a “success parameter” σε([0, 1]. (In experiments,σ=0.1 is selected).

First, decay is described as a random process. Next, it is given aformal recursive definition, and finally, it is cast as a random walk ina Markov chain.

The measure can be seen as a random process governing a “web surfer” asfollows. Initially, the current page p is set to i, the page whose decayis being computed (step 200 in FIG. 6). The surfer at the current pagewill perform the following steps, eventually returning a binary decayscore depending on the random choices made during execution of thesteps; the process therefore defines a distribution over {0, 1}. Thedecay D_(σ)(i) is the mean of this distribution.

-   -   1. If pεD, the surfer terminates with decay value 1: the page is        completely decayed (Steps 212 and 214 in FIG. 6).    -   2. Otherwise the result is “no” (Step 216 in FIG. 6), and the        surfer flips a biased coin at step 220, and with probability a        decides that the content of the current page meets his        information need (Step 230 in FIG. 6), and hence terminates        successfully with decay score 0 (Step 234 in FIG. 6).    -   3. With the remaining probability 1−σ, the surfer chooses an        outlink of p uniformly at random (Step 236 in FIG. 6), sets p to        be the destination of that outlink, and begins the again from        step 200.

Unrolling this definition a few steps, it becomes clear that the decayof a page is influenced by dead pages a few steps away, but that theinfluence of a single path decreases exponentially with the length ofthe path. For example a dead page has decay 1, a live page hose outlinksare all dead has decay 1−σ, a live page whose all outlinks point to livepages that in turn point only to dead pages has decay (1−σ)², etc.

Now, a formal definition of the decay measure is given. Recursively,D_(σ)) is defined as follows: ${D_{\sigma}(i)} = \{ \begin{matrix}1 & {{i \in D},} \\{( {1 - \sigma} )( \frac{\sum\limits_{j \in {\lbrack n\rbrack}}\quad{M_{ij}{D_{\sigma}(j)}}}{\sum\limits_{j \in {\lbrack n\rbrack}}\quad M_{ij}} )} & {{otherwise}.}\end{matrix} $Understanding the solution to this recursive formulation is easiest inthe context of random walks, as described below.

Decay scores may also be viewed as absorption probabilities in a randomwalk. A Markov chain in which this walk takes place is now defined.First, the incidence matrix of the web graph must be normalized to berow stochastic (each nonzero element is divided by its row sum). Next,two new states must be added to the chain, each of which has a singleoutlink to itself: n+1 is the success state, and n+2 is the failurestate. Thus these two new states are absorbing. Finally, the followingtwo modifications are made to the matrix: first, each dead state ismodified to have a single outlink with probability 1 to the failurestate; second, all edges from non-dead states ([n]\D) are multiplied by1−σ in probability, and a new edge with probability σ is added to thesuccess state. Hence the two new states are the only two absorbingstates of the chain, and any random walk in this chain will beeventually absorbed in one of the two states. Walks in this new chainmirror the random process described above, and the decay of page i isthe probability of absorption in the failure state when starting fromstate i.

Global static ranking measures such as PageRank [7] usually have to becomputed globally for the entire graph during a lengthy batch process.Other graph oriented measures such as HITS [21] may be computedon-the-fly, but require inlink information typically derived from acomplete representation of the web graph, such as [4], or from a largescale search engine that makes available information about the inlinksof a page.

Decay, on the other hand, is defined purely in terms of theout-neighbors of i. The following observations can be made:

-   -   OBSERVATION 1. The decay value of a page can be approximated to        within constant accuracy in a constant number of HTTP fetches,        independent of the link structure of the graph, without access        to any other supporting indexes.

Such an implementation mirrors the random process definition of decayset forth previously. Because the walk terminates with probability atleast σ at each step, the distribution over number of steps is boundedabove by the geometric distribution with parameter σ; thus, the expectednumber of steps for a single trial is no more than 1/σ, and theprobability of long trials is exponentially small. Further, the value ofeach trial is 0 or 1, and so decay can be estimated to within error εwith probability 1−δ in O(1/ε² log 1/δ) steps; this follows fromstandard Chernoff bounds. (In practice, 300 trials are employed toestimate the decay value of each page).

An alternative method operating in accordance with the present inventionfor assessing the decay of a web page is depicted in FIG. 7. At step250, a subject web page containing hyperlinks is accessed over theinternet. Then at step 251, the decay of the subject web page isassessed by following a random walk away from the subject web page,where the random walk consists of testing of links on web pages linkingfrom the subject web page under test. In variants of this embodiment,the links being tested may be on web pages directly linked to thesubject web page whose decay status is being tested, or may be on webpages linked to the subject web page by an arbitrary number ofintermediate web pages and hyperlinks. Then at step 252, a decay scoreis assigned to the subject web page in dependence on dead linksencountered in the random walk, wherein the decay score is a weightedsliding scale, where a dead link encountered relatively close in therandom walk to the subject web page results in a higher decay score thana dead link encountered relatively farther away from the subject webpage.

Like other measures, decay is also amenable to the more traditionalbatch computation; it is expected to require a time similar to the timerequired by PageRank.

Next, the algorithm for identifying dead pages and the random walkalgorithm for estimating the decay score of a given page wasimplemented. Then several sets of experiments described below were run.The first set of experiments validated that the decay measure set forthpreviously is a reasonable measure for the decay of web pages. Next, itwas compared to another plausible measure, namely, the fraction of deadlinks on a page. After establishing that the present decay measure isreasonable, it was used to discover interesting facts about the web.

In this section the settings of parameters for two algorithms that wereused in the experiments are described. The parameters of the algorithmfor detecting dead pages were set as follows:

-   -   A timeout of T=10 seconds was allowed for fetching a page. If        the server does not respond within 10 seconds, the page is        declared dead.    -   At most L=20 redirects for a page are allowed. If more than 20        redirects are encountered, the page was declared dead.    -   To create a random URL in the same directory of the page, the        parent directory is appended with a sequence of 25 random lower        case Latin letters.

The parameters of the random walk algorithm were set as follows:

-   -   In general, a success parameter σ=0.1 is used. Thus, at each        step of the random walk, with probability 0.1, the random walk        proceeds to the success absorbing state. The expected length of        a random walk is then at most 10.    -   For each page, the random walk algorithm is run 300 times. This        guarantees an additive error in the decay measure estimates of        at most 0.1 with confidence at least 0.8.

On average, getting the decay score of a page took about 7 minutes on amachine with double 1.6 GHz AMD processors, 3 GB of main memory, runninga Linux operating system and having a 100 Mbps connection to thenetwork. Since the task was highly parallelizable (the decay score ofdifferent pages could be estimated in parallel, and also differentrandom walks for the same page could be run in parallel), about 10random walk processes were run simultaneously, in order to increasethroughput.

The first experiment involved computing the decay score and the fractionof dead links on 1000 randomly chosen pages. The pages were chosen froma two billion page crawl performed largely in the last four months.

To begin with, of the 1000 pages, 475 were already dead (substantiatingthe claim that web pages have short half lives, on average). For eachremaining page, its decay score was computed as well as the fraction ofits dead links. In total, there were 710 dead links on the pages and outof these, 207 were pointing to soft-404 pages (roughly 29%). Moreover,the random walks during the decay score computation of the 525 pagesencountered a total of 22,504 dead links, out of which 6,060 pointed tosoft-404 pages (roughly 27%). Another interesting statistic is that only350 of the 525 pages alive had a non-empty “Last Modified Date”.

The main statistic emerging out of this experiment is that the averagefraction of dead links is 0.068 whereas the average decay scores of alive page with at least one outlink are 0.168, 0.106, 0.072, and 0.041for values of σ=0.1, 0.2, 0.33 and 0.5, respectively.

The decay curves in FIG. 8 reflect the fact that for a given page i ifσ₁≧σ₂, thenD _(σ1)(i)≧D _(σ2)(i).Proof: The decay is the probability of absorption into the failurestate. Consider all paths that lead to the failure state. Then theweight of each individual path under σ₁ is less or equal to its weightunder σ₂; namely for a path of length k it is (1−σ_(i))^(k) times theunbiased random walk weight of the path. (The same argument does notwork for the paths that lead to the success state; their individualweight is not monotonic in σ.).

For the rest of the description, σ=0.1 is used.

Clearly the decay and the fraction of dead links are related but not ina simple way. More precisely, if

(i) is the fraction of dead links on page i, and page i is not dead then

(i)=(1−σ)(

(i)+(1−

(i))

(i))where

(i) is the average decay of the non-dead neighbors of i.

FIG. 8 shows that the distributions of

and

intersect. The difference among them can also be seen from the scatterplot of these distributions for σ=0.1 (FIG. 9). The scatter plot showsthat the decay score is generally more than the fraction of dead links.(This also follows from equation 1). More interestingly, it also showsthat the decay measure can be close to 0.5 even when the fraction of thedead links is close to 0.

The next experiments to be described concern papers from the last tenWorld Wide Web conferences. All of the (refereed track) papers from WWW3to WWW12 were crawled and for each paper with at least one outlink, itsdecay score and the fraction of dead links was computed. The averagedresults are shown in FIG. 10. The main observation is the following. Itis claimed that the trend exhibited by decay scores is morerepresentative and more useful than that of the fraction of dead links.From the figure, it is evident that the decay scores decline asconferences get more recent; on the other hand, the fraction of deadlinks exhibits a flatter trend. It is arguably the case that on average,links contained in papers from older conferences not only have a higherchance of themselves being dead, but also are more likely to point topages that are dead. Decay scores are therefore able to reflect betterthe temporal aspect of hyperlink creation and maintenance; it isbelieved this feature might have other applications.

The next experiment performed consisted of a set of 30 nodes from thecurrent Yahoo! ontology (Appendix B). The nodes were chosen so as tohave a relatively large number of outside links and be well representedin the Internet Archive (www.archive.org). The decay score and fractionof dead links were computed for each of the 30 nodes. The InternetArchive was used to fetch the previous incarnations of the same nodes inthe past five years and computed the decay scores and fraction of deadlinks for these “old” pages as well. Since the archived pages have timestamps embedded in the URL, at the end of this step, a history of decayscores and fraction of dead links for each leaf was obtained. Thesescores were averaged over the 30 nodes and the time line bucketed intomonths (since 1998) to obtain FIG. 11.

The behavior of decay scores and fraction of dead links are stilldifferent; but the important point is that this difference in behavioris different from that of WWW conferences as well (FIG. 10). Unlike inthe WWW conference case, here, the decay score is flatter whereas thefraction of dead links is rapidly decreasing. The behavior of the deadlinks is as expected—the fraction of dead links is close to 0 in thecurrent version of the Yahoo! nodes; this is obviously due to theirautomatic filtering of dead links. But, even in the current version ofthese nodes, the figure shows that the decay score of these is as highas that of a random web page (i.e., close to 0.2).

Thus, it can be concluded that many of the pages pointed by Yahoo!nodes, even though are not dead themselves yet, are littered with deadlinks and outdated. For example, consider the Yahoo! categoryHealth/Nursing. Only three out of 77 links on this page are dead.However, the decay score of this page is 0.19. A few examples of deadpages that can be reached by browsing from the above Yahoo! page are:(1) the page http://www.geocities.com/Athens/4656/has an ECG tutorialwhere all the links are dead; (2) the pagehttp://virtualnurse.com/er/er.html has many dead links; (3) many of thelinks in the menu bar of http://www.nursinglife.com/index.php?n=1&id1are dead; and so on. It is believed that using decay scores in anautomatic filtering system will improve overall quality of links in ataxonomy like Yahoo!.

The final set of experiments to be described involved the frequentlyasked questions (FAQs) obtained from www.faqs.org. All 3,803 FAQs werecollected and decay scores and the fraction of dead links were computedfor each of them. The last modified/last updated date for the FAQs wascomputed by explicitly parsing the FAQ (since the last modified datereturned in the HTTP header from www.faqs.org does not represent theactual date when the FAQ was last modified/updated). As in the earliercase, the results were collated and the time line bucketed into yearssince 1992 to obtain FIG. 12.

From the figure, it is clear that despite the fact that the FAQs arehand-maintained in a distributed fashion by a number of diverse andunrelated people, it suffers from the same problem—many pages pointed toby FAQs are unmaintained.

A number of applications areas could fruitfully apply the decay concept:

(1) Webmaster and ontologist tools: There are a number of tools madeavailable to help webmasters and ontologists track dead links on theirsites; however, for web sites that maintain resources, there are notools to help understand whether the linked-to resources are decayed.The observation about Yahoo! leaf nodes suggests that such tools mightprovide an automatic or semi-automatic approach to addressing the decayproblem.

(2) Ranking: Decay measures have not been used in ranking, but usersroutinely complain about search results pointing to pages that either donot exist (dead pages), or exist but not reference valid currentinformation (decayed pages). Incorporating the decay measure into therank computation will alleviate this problem. Furthermore, web searchengines could use the soft-404 detection algorithm to eliminate soft-404pages from their corpus. Note that soft-404 pages indexed under theirnew content are still problematic since most search engines put asubstantial weight on anchor text, and the anchor text to soft-404 pagesis likely to be quite wrong.

(3) Crawling: The decay score can be used to guide the crawling processand the frequency of the crawl, in particular for topic sensitivecrawling [12]. For instance, one can argue that it is not worthwhile tofrequently crawl a portion of the web that has sufficiently decayed; asseen in the described experiments, very few pages have valid lastmodified dates in them. The on-the-fly random walk algorithm forcomputing the decay score might be too expensive to assist this decisionat crawl-time but post a global crawl one can compute the decay scoresof all pages on the web at the same cost as PageRank. Heavily decayedpages can be crawled infrequently.

(4) Web sociology and economics: Measuring decay score of a topic cangive an idea of the ‘trendiness’ of the topic.

Thus it is seen that the foregoing description has provided by way ofexemplary and non-limiting examples a full and informative descriptionof the best methods and apparatus presently contemplated by theinventors for assessing the currency or staleness of web pages. Oneskilled in the art will appreciate that the various embodimentsdescribed herein can be practiced individually; in combination with oneor more other embodiments described herein; or in combination withmethods and apparatus differing somewhat from those described herein.Further, one skilled in the art will appreciate that the presentinvention can be practiced by other than the described embodiments; thatthese described embodiments are presented for the purposes ofillustration and not of limitation; and that the present invention istherefore limited only by the claims which follow.

-   [1] W. Aiello, F. Chung, and L. Lu. A random graph model for power    law graphs. Experimental Mathematics, 10:53-66, 2001.-   [2] Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D.    Weitz. Approximating aggregate queries about web pages via random    walks. In Proceedings of the 26th International Conference on Very    Large Databases, pages 535-544, 2000.-   [3] A.-L. Barabási and R. Albert. Emergence of scaling in random    networks. Science, 286:509-512, 1999.-   [4] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S.    Venkatasubramanian. The connectivity server: Fast access to linkage    information on the Web. In Proceedings of the 7th International    World Wide Web Conference, pages 104-111, 1998.-   [5] K. Bharat and M. Henzinger. Improved algorithms for topic    distillation in a hyperlinked environment. In Proceedings of the    21st Annual International ACM SIGIR Conference on Research and    Development in Information Retrieval, pages 104-111, 1998.-   [6] B. Brewington and G. Cybenko. How dynamic is the web? In    Proceedings of the Ninth International World Wide Web Conference,    pages 257-276, May 2000.-   [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual    Web search engine. In Proceedings of the 7th International World    Wide Web Conference, pages 107-117, 1998.-   [8] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig.    Syntactic clustering of the Web. In Proceedings of the 6th    International World Wide Web Conference, pages 391-404, 1997.-   [9] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S.    Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in    the web. WWW9/Computer Networks, 33(1-6):309-320, 2000.-   [10] A. Z. Broder, R. Lempel, F. Maghoul, and J. Pedersen. Efficient    Pagerank approximation via graph aggregation. Manuscript.-   [11] S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S.    Rajagopalan, and A. Tomkins. Spectral filtering for resource    discovery. In Proceedings of the ACM SIGIR Workshop on Hypertext    Analysis, pages 13-21, 1998.-   [12] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling:    a new approach to topic-specific web resource discovery.    WW8/Computer Networks, 31(11-16):1623-1640, 1999.-   [13] J. Cho and H. Garcia-Molina. The evolution of the web and    implications for an incremental crawler. In Proceedings of the 26th    International Conference on Very Large Databases, pages 200-209,    2000.-   [14] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul.    Rate of change and other metrics: a live study of the world wide    web. In USENIX Symposium on Internet Technologies and Systems, 1997.-   [15] B. Edelman. Domains reregistered for distribution of unrelated    content: A case study of “Tina's Free Live Webcam”.    http://cyber.law.harvard.edu/people/edelman/renewals/, 2002.-   [16] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A    large-scale study of the evolution of web pages. In Proceedings of    the 12th International World Wide Web Conference, pages 669-678,    2003.-   [17] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P.    Leach, and T. Berners-Lee. RFC2616: Hypertext Transfer    Protocol—HTTP/1.1. http://www.w3.org/Protocols/rfc2616/rfc2616.html,    June 1999.-   [18] T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the    11th International World Wide Web Conference, pages 517-526, 2002.-   [19] M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On    near-uniform URL sampling. WWW9/Computer Networks, 33(1-6):295-308,    2000.-   [20] A. Jesdanun. Internet littered with dead web sites.    http://story.news.yahoo.com/news?tmpl=story&n=/ap/20031102/ap_on_hi_t    e/deadwood_online_(—)1, November 2002.-   [21] J. M. Kleinberg. Authoritative sources in a hyperlinked    environment. Journal of the ACM, 46(5):604-632, 1999.-   [22] W. Koehler. An analysis of web page and web site constancy and    permanence. Journal of the American Society for Information Science,    50(2):162-180, 1999.-   [23] W. Koehler. Digital libraries and world wide web sites and page    persistence. Information Research, 4(4), 1999.-   [24] K. Kokoszkiewicz (a.k.a. Alectorides Conradus). Vocabula    Computatralia Anglico-Latinum. University of Warsaw, Centre for    Studies on the Classical Tradition in Poland and East-Central Europe    (OBTA). http://www.obta.uw.edu.pl/˜draco/docs/voccomp.html.-   [25] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A.    Tomkins, and E. Upfal. Stochastic models for the web graph. In    Proceedings of the 41st IEEE Annual Foundations of Computer Science,    pages 57-65, 2000.-   [26] J. Markwell and D. W. Brooks. Broken links: The ephemeral    nature of educational WWW hyperlinks. Journal of Science Education    and Technology, 11(2):105-108, 2002.-   [27] J. Markwell and D. W. Brooks. “Link rot” limits the usefulness    of web-based educational materials in biochemistry and molecular    biology. Biochemistry and Molecular Biology Education, 31(1):69-72,    2003.-   [28] A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The    evolution of the web from a search engine perspective. In    Proceedings of the 13th International World Wide Web Conference,    2004.-   [29] G. Pandurangan, P. Raghavan, and E. Upfal. Using PageRank to    characterize web structure. In Computing and Combinatorics: 8th    Annual International Conference, pages 330-339, 2002.-   [30] P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L.    Giles. Methods for sampling pages uniformly from the world wide web.    In Proceedings of the AAAI Fall Symposium on Using Uncertainty    Within Computation, pages 121-128, 2001.

[31] J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L.Ozsen. Optimal crawling strategies for web search engines. InProceedings of the 11th International World Wide Web Conference, pages136-147, 2002. APPENDIX A Function bool DeadPage(u) In URL u  1: stringT_(u), T_(r), int K_(u), K_(r), bool error  2: fetch (u, w_(u), T_(u),K_(u), error)  3: if (error) then // a hard 404 error  4: return true 5: URL r = u.PARENT + 25 random characters  6: fetch(r, w_(r), T_(r),K_(r), error)  7: if (error) then // host returns a hard-404 on deadpages  8: return false  9: if (u is the root of u.HOST) then 10: returnfalse // a root cannot be a soft-404 11: if (K_(u) ≠K_(r)) then //different number of redirects 12: return false 13: if (w_(u) = w_(r))then // same redirects & same number of redirects 14: return true 15: if(shingle(T_(u)) = shingle(T_(r))) then // almost-identical content 16:return true 17: return false // not a soft-404 page Function fetch (u,T_(u), w_(u), K_(u), error) in: URL u out: string T_(u), URL w_(u), intK_(u), bool error  1: w_(u) := u  2: K_(u) := 0  3: set <URL> redirects 4: redirects.insert(u)  5: while (true) do  6: URL v, bool redirect  7:atomicFetch (w_(u), T_(u), v, redirect, error)  8: if (error) then  9:return // A hard-404 10: if (!redirect) then // no more redirects 11:return 12: if (redirects.find(v)) then // a redirect loop 13: error =true; return 14: if (K_(u) ≧20) then // too many redirects 15: error =true; return 16: w_(u) := v, K_(u) := K_(u) + 1 17: end while Functionatomic fetch (w, T, v, redirect, error) in: URL w out: string T, URL v,bool redirect, bool error  1: parse (w, error)  2: if (error) then //parse URL failed  3: return  4: IP Address address  5: getIPAddress(w.HOST, address, error)  6: if (error) then // resolution of host's IPaddress failed  7: return  8: HTTPRetCode code  9: httpGet (address, T,v, code, timeout = 10 sec, error) 10: if (error) then // http got timedout 11: return 12: if (code in {403, 404, 410, 5xx}) then // bad httpreturn code 13: error = true; return 14: if (code in {3xx}) then 15:redirect := true 16: else 17: redirect := false

APPENDIX B

1. Business_and_Economy/Classifieds

2. Business_and_Economy/Employment_and_Work/Organizations

3. Computers_and_Internet/News_and_Media/Magazines

4. Computers_and_Internet/News_and_Media/Magazines

5. News_and_Media/Journalism

6. News_and_Media/Television/Satellite

7. Entertainment/Music/Band_Naming

8. Entertainment/Humor

9. Recreation/Automotive

10. Recreation/Gambling

11. Health/Medicine

12. Health/Nursing

13. Health/Fitness

14. Govemment/Military/Weapons_and_Equipment

15. Government/Law

16. Regional/U_S_States/California/Education

17.Regional/Countries/France/Arts_and_Humanities/Museums_Galleries_and_Centers

18. Society_and_Culture/Environment_and_Nature

19. Society_and_Culture/Food_and_Drink/Cooking

APPENDIX B Continued

20. Society_and_Culture/Death_and_Dying

21. Education/Higher_Education

22. Education/K_(—)12/Gifted_Youth/Schools

23. Arts/Visual_Arts/Photography/Digital

24. Arts/Humanities/Literature/Poetry

25. Science/Computer_Science/Electronic_Computer_Aided_Design_ECAD_(—)

26. Science/Biology/Zoology/Animals_Insects_and_Pets/Pets/Health

27. Social_Science/Psychology/Branches/Sleep_and_Dreams

28. Social_Science_Anthropology_and_Archaeology

29. Reference/Quotations

30. Reference/Dictionaries

1.-3. (canceled)
 4. A signal-bearing medium tangibly embodying a programof machine-readable instructions executable by a digital processingapparatus of a computer system to perform operations for assessing thecurrency of a web page, the operations comprising: receiving auser-specified topicality threshold, where the topicality thresholdconcerns the topicality of material content of the web page; accessing aweb page; extracting topicality information from the web page; andcomparing the topicality information extracted from the web page to thetopicality threshold.
 5. The signal-bearing medium of claim 4 furthercomprising: identifying the web page as lacking currency if thetopicality information extracted from the web page lacks topicality whencompared to the topicality threshold.
 6. The signal-bearing medium ofclaim 4 further comprising: identifying the web page as being current ifthe topicality information extracted from the web page is topical whencompared to the topicality threshold. 7.-30. (canceled)
 31. A computersystem for assessing the currency of a web page, the computer systemcomprising: an internet connection for connecting to the internet andfor accessing web pages available on the internet; at least one memoryto store web pages retrieved from the internet and at least one programof machine-readable instructions, where the at least one programperforms operations to assess the currency of a web page; at least oneprocessor coupled to the internet connection and the at least onememory, where the at least one processor performs the followingoperations when the at least one program is executed: retrieving apredetermined topicality threshold, where the topicality thresholdconcerns the topicality of material comprising a web page; extractingtopicality information from the web page; and comparing the topicalityinformation extracted from the web page to the topicality threshold. 32.The computer system of claim 31 where the operations further comprise:identifying the web page as lacking currency if the topicalityinformation extracted from the web page lacks topicality when comparedto the topicality threshold.
 33. The computer system of claim 31 wherethe operations further comprise: identifying the web page as beingcurrent if the topicality information extracted from the web page istopical when compared to the topicality threshold. 34.-54. (canceled)55. A computer-implemented method for assessing the currency of a webpage, the method comprising: receiving a user-specified topicalitythreshold, where the topicality threshold concerns the topicality ofmaterial content of the web page; accessing a web page; extractingtopicality information from the web page; and comparing the topicalityinformation extracted from the web page to the topicality threshold. 56.The computer-implemented method of claim 55 further comprising:identifying the web page as lacking currency if the topicalityinformation extracted from the web page lack topicality when compared tothe topicality threshold.
 57. The computer-implemented method of claim 4further comprising: identifying the web page as being current if thetopicality information extracted from the web page is topical whencompared to the topicality threshold.